{
 "metadata": {
  "name": "",
  "signature": "sha256:384e4f2ba200558f9132d6aa2cb0f0b1ebe46d58e925d6560e6ae19bccc12b6f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Iterators, Generators, and MapReduce(rators)\n",
      "\n",
      "2014-08-14, Josh Montague\n",
      "\n",
      "\n",
      "In this notebook, we'll walk through some of the details behind how Python lets us easily iterate over containers, the guidelines to follow when creating a custom iterable, and how the iterator concept extends to generators. We'll get just a taste of generators (which are very powerful!), and then see them in action with a simple example of using the MapReduce framework. \n",
      "\n",
      "------\n",
      "\n",
      "This session was built using:\n",
      "\n",
      "- Python 2.7\n",
      "- IPython 2\n",
      "- mrjob 0.4 (optional)\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Iterables and Iterators\n",
      "\n",
      "\n",
      "### Common Usage\n",
      "\n",
      "Let's start with examples of how we already use these things. Lists are iterables. You can use a ``for`` loop to iterate over iterables and you're already falling asleep because you know this. Bare with me!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "l = [1, 2, 3] \n",
      "\n",
      "for x in l:\n",
      "    print x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Similarly, you're almost certainly familiar with iterating over ``dicts``, ``strs``, ``tuples``, and files:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s = 'josh'\n",
      "d = {'j':1, 'o':2, 's':3, 'h':4}\n",
      "t = (1, 2, 3)\n",
      "f = open('moby.txt')\n",
      "\n",
      "print '*'*10 + ' str ' + '*'*10\n",
      "for x in s:\n",
      "    # iterate over characters, x is a str\n",
      "    print x\n",
      "\n",
      "print    \n",
      "print '*'*10 + ' dict ' + '*'*10    \n",
      "for x in d:   # remember: not in any particular order!\n",
      "    # iterate over dict keys, x is a str\n",
      "    print x   \n",
      "    \n",
      "print     \n",
      "print '*'*10 + ' tuple ' + '*'*10    \n",
      "for x in t:\n",
      "    # iterate over members, x is an int\n",
      "    print x\n",
      " \n",
      "print     \n",
      "print '*'*10 + ' file ' + '*'*10    \n",
      "for x in f:\n",
      "    # iterate over lines, x is a str\n",
      "    print '> ' + x[:50] + '...'     "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are also a bunch of constructors and built-in functions take iterables as arguments:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# strings are iterables\n",
      "print list(\"hello\")\n",
      "print\n",
      "\n",
      "# range() returns an iterable\n",
      "y = [ 2*x for x in range(10) ]\n",
      "\n",
      "# lists are iterables\n",
      "print sum(y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Why? Iterator Protocol!\n",
      "\n",
      "The reason all of these common usages work so nicely is that the built-in objects all obey the \"iterator protocol\". The terminology can be a little confusing, so [here are](https://docs.python.org/2/glossary.html#term-iterable) some [relevent links](https://docs.python.org/2/library/stdtypes.html#iterator-types) to the official glossary in the docs. The exact definitions are not the point, but they are helpful to start wrapping your head around. Here are some of the key points:\n",
      "\n",
      "- **iterable**: an object capable of returning its members one at a time, has internal  methods like ``__getitem__()``, ``__iter__()`` that deal with usage; think ``list``, ``dict``, ``tuple``, ``str``.\n",
      "\n",
      "- **iterator**: an object representing a stream of data, really a subset of iterables that require a ``next()`` method, and raise a ``StopIteration`` exception when the object is \"out\". \n",
      "\n",
      "These internal methods and exception handling mean that as long as an object is built properly, using it in something like a ``for`` loop will be transparently easy. Python does an amazing job of hiding all of this from you when you're getting started. All of the built-in types obviously have these things, prebaked. \n",
      "\n",
      "We write code that looks like this:\n",
      "\n",
      "    for x in obj:\n",
      "        # do stuff with x\n",
      "\n",
      "And Python does something like this:\n",
      "\n",
      "    _i = obj.__iter__()\n",
      "    while True:\n",
      "        try:\n",
      "            x = _i.next()\n",
      "        except StopIteration:\n",
      "            break\n",
      "        # do stuff with x\n",
      "\n",
      "By combining those abbreviated definitions above and the example code just above, we can see that our typical ``for`` loop takes an *iterable* ``obj`` and creates an temporary *iterator* ``_i`` from it, which is used within the loop.\n",
      "\n",
      "So, iterables and iterators will have different attributes. Let's verify this with the awesome ``dir()`` function. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# l is a list, an iterable\n",
      "l = [1,2,3]\n",
      "\n",
      "# create an iterator object from l\n",
      "li = iter(l)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# compare the methods and attributes\n",
      "print \"**** iterable ****\"\n",
      "print\n",
      "print l.__class__\n",
      "print\n",
      "print dir(l)\n",
      "\n",
      "print\n",
      "print\n",
      "print \"**** iterator ****\"\n",
      "print\n",
      "print li.__class__\n",
      "print\n",
      "print dir(li)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we can also look a little closer at how the ``.next()`` method works by explicitly calling it on an iterator that we've created:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# l is an iterable\n",
      "l = [\"ready\", \"set\", \"go\"]\n",
      "\n",
      "# i is an iterator on the list iterable\n",
      "i = iter(l)\n",
      "\n",
      "i"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For this cell where we'll use the ``next()`` function, use the in-place evaluation ``<CTL + return>``, and see what the iterator returns each time."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "i.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So, now you can better appreciate how the ``for`` loop works so nicely... it's built to handle exactly this behavior. Here's that code snippet from above again:\n",
      "\n",
      "    for x in obj:\n",
      "        # do stuff with x\n",
      "\n",
      "Under the hood:\n",
      "\n",
      "    _i = obj.__iter__()    # this is what iter(obj) actually does\n",
      "    while True:\n",
      "        try:\n",
      "            x = _i.next()\n",
      "        except StopIteration:\n",
      "            break\n",
      "        # do stuff with x"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### I got 99 problems...\n",
      "\n",
      "#### ... but not 99^99; that's too many problems to iterate over.\n",
      "\n",
      "Ok, so now hopefully you have a deeper appreciation for how the built-in objects in Python let you use loops and other constructs for iterating. \n",
      "\n",
      "\n",
      "There are (at least) two ways that you can have trouble using containers that step through a sequence of members:\n",
      "\n",
      "- you have custom data types for containers or members (or both)\n",
      "- the data structures are built-in, but the sequence is incredibly large (or infinite)\n",
      "\n",
      "In the first case, imagine want to create a special iterator object that represents a \"countdown\" (this example is all over the place online). To do this, you'd need to write a class that has (at a minimum):  ``__init__()``, ``__iter__()``, and ``next()`` methods such that you can drop your object into a ``for`` loop: ``for x in countdown(5): ...`` That's a lot of overhead for something that doesn't seem very complicated. Hold that thought.\n",
      "\n",
      "In the second case, imagine writing a function so I end up with a list of all the integers below 99. No problem; keep assembling the list (``the_list.append()``) while incrementing a counter somewhere, and then at the end you: ``return the_list``. Now, please extend the function so I end up with a list of all integers below 10^20; or how about a never-ending list of integers, like a stream? Trouble. Most likely trouble due to lack of memory; we can't build a list that large in memory and then ``return`` it. \n",
      "\n",
      "This brings us to...\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Generators / generator functions\n",
      "\n",
      "*It seems that people most often refer an instance of \"a generator function\" as \"a generator\", so be aware of the potential for naming ambiguity.*\n",
      "\n",
      "A generator is a nicer and more convenient way to solve the problems outlined above, and more! Generators looks like ordinary functions but include the ``yield`` keyword, and generally behave quite differently. In fact, *they're functions that return an iterator.*\n",
      "\n",
      "Importantly, **generators yield processing control while maintaining state and position.** This is in contrast to the manner in which functions enter, execute their code all the way to either an Exception or a ``return`` statement, and subsequently lose all local variables and values (the \"state\"). \n",
      "\n",
      "Some other key things about generators:\n",
      "\n",
      "- on ``next()`` call, the generator function executes all the way to the next ``yield``, returns the appropriate value, and then pauses until the next call to ``next()``\n",
      "- when \"completed\", the generator raises the ``StopIteration`` exception, as expected (as saw for an iterator)\n",
      "- the generator object only calculates / returns one value at a time, regardless of the length of the series (lazy evaluation)\n",
      "\n",
      "Conceptually, you can think of these things as being somewhat like subsets (but not exactly):\n",
      "\n",
      "``[ iterables  [ iterators  [ generators ] ] ] ``\n",
      "\n",
      "Let's go through some of the characteristics of these things, by implementing the countdown object we talked about earlier. Note that there are none of the class methods we said we'd need to do this as a pure iterator!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# countdown is a generator function. note the \"yield\"\n",
      "def countdown(n):\n",
      "    print \"starting countdown from %s....\" % n\n",
      "    while n > 0: \n",
      "        yield n\n",
      "        print \"decrementing count...\"\n",
      "        n -= 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "No code executes within the generator function when we *make* the object:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = countdown(10)\n",
      "\n",
      "x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Remember that the generator function returns a generator, which behaves like an iterator. So we already know how to get items out of an iterator using the ``next()`` method. Importantly: note the flow of execution by looking for the ``print`` statements:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note the first thing that executes is the ``print`` statement that immediately follows the ``yield`` in the generator function. Execution picks up at this point every time we call ``next()``. When the generator is empty, it raises the expected ``StopIteration`` exception."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# use the CTL-return execution on this cell\n",
      "x.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Clearly this object is set up properly to use in a ``for`` loop. I'll remove the print statements this time:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def quiet_countdown(n):\n",
      "    while n > 0: \n",
      "        yield n\n",
      "        n -= 1\n",
      "        \n",
      "for x in quiet_countdown(10):\n",
      "    print x"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The other example we gave in which iterators fall short is where the end result (the series) is very long. Imaging that we'd like a way to return a long series of numbers, such that we can sum them for a final answer. Compare the two following ways of accumulating the series, one using a list and one using a generator:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def first_n_list(n):\n",
      "    current, result = 0, list()\n",
      "    while current < n:\n",
      "        result.append(current)     # result will have to be n items long\n",
      "        current += 1\n",
      "    return result\n",
      "\n",
      "def first_n_gen(n):\n",
      "    current = 0\n",
      "    while current < n:\n",
      "        yield current              # there is only one number, current\n",
      "        current += 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In both cases, the result can be used by ``sum`` to find the answer:\n",
      "\n",
      "    answer = sum( first_n_list(100000000) )\n",
      "    answer = sum( first_n_gen(100000000) )\n",
      "    \n",
      "The reason that both of these expressions look the same is that the ``sum`` method takes care of asking each iterator for its next item, increments its internal result, and proceeds until the end of the iterator. However, the difference is in the flow of control. For the list-based function, this is approximately:\n",
      "\n",
      "- ``sum`` expects an iterator, asks ``first_n_list`` to evaluate\n",
      "- ``first_n_list`` evaluates *in its entirety*, returns a 100-million item list\n",
      "- ``sum`` iterates through the list, incrementing its internal result each time\n",
      "- the giant list raises a ``StopIteration`` exception to ``sum``, who now knows its done, returns result to ``answer``\n",
      "\n",
      "But for the generator function version:\n",
      "\n",
      "- ``sum`` expects an iterator, asks ``first_n_gen`` to evaluate\n",
      "-  ``first_n_gen`` begins execution until ``yield``, returns a **single value and control** to ``sum``, pauses\n",
      "- ``sum`` receives first value, increments internal counter, asks generator for ``next()`` \n",
      "- ``first_n_gen`` resumes execution until ``yield``,  returns value *and control* ... \n",
      "- ... and so on, until ... \n",
      "- ``first_n_gen`` raises its ``StopIteration`` exception to ``sum``, who now knows its done, returns result to ``answer``\n",
      "\n",
      "So both versions needed to keep approximately three important things in memory: the current result within ``sum``, the current result from the ``first_n_*`` method, and then one more. However, for the list-based approach, that extra thing was a 100 million-item list; for the generator, it was a single integer."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## A Tale of Two (Or Three) Generator Functions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Imagine that we have a module called ``word_count.py`` that looks something like the code below. Note the ``yield``s in the three generator functions! \n",
      "\n",
      "```\n",
      "class MRWordFrequencyCount(MRJob):\n",
      "\n",
      "    def mapper(self, _, line):\n",
      "        for word in WORD_RE.findall(line):\n",
      "            yield word.lower(), 1\n",
      "\n",
      "    def combiner(self, word, counts):\n",
      "        yield word, sum(counts)\n",
      "\n",
      "    def reducer(self, word, counts):\n",
      "        yield word, sum(counts)\n",
      "\n",
      "```\n",
      "\n",
      "Ok, now you can stop imagining, because this code is actually here in the repo! This is the prototypical introduction to the MapReduce framework. It's certainly one of the most common snippets of code in existence (maybe second to ``hello world``). \n",
      "\n",
      "\n",
      "Specifically, this code is from the [Python library](https://pythonhosted.org/mrjob/guides/quickstart.html#) ``mrjob`` (em-arr-job). This is one of a handful of Python wrappers for using MapReduce; ``mrjob`` has proven useful for the ability to write your code once, run the code locally to look for errors or unexpected behaviors, and then seamlessly launch a distributed job on AWS EC2 instances with only a couple of command-line parameter changes, no changes to your code needed. That said, when it comes to debugging errors that come from Hadoop... you're on your own :)\n",
      "\n",
      "If you don't have it installed and want to do this last part, go ahead and \n",
      "\n",
      "    pip install mrjob\n",
      "    \n",
      "in your favorite Python environment. \n",
      "\n",
      "Let's drop to the shell and run this simple job, locally, on the first few chapters of Moby Dick, and see all those generators in action!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# A one-liner that leaves an IPython notebook to run \n",
      "#     a local MapReduce job on your laptop. Awesome!\n",
      "!python word_count.py moby.txt 2>/dev/null"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### What's Happening (the short version)?\n",
      "\n",
      "``mrjob`` is doing a lot of work for you, mostly behind the scenes. And I'll be honest: there are still a few corners of this process that seem like magic to me. If you want to start digging, you can always [check the docs](https://pythonhosted.org/mrjob/guides/concepts.html). However, we can still talk about how this process uses generators. \n",
      "\n",
      "For each line in the input file, the mapper tokenizes the line with a regular expression, and yields (for example) ``(\"whale\", 1)`` for each word. The part that's \"invisible\" here, is that the counts for each key are added to a generator, such that we end up with an input to the combiner that looks like ``(\"whale\", <generator of counts for each line>)``. The combiner then sums all the values (from one mapper) in the list and yields ``(\"whale\", 8)``. Similarly, each of these counts is added to a generator such that the reducer does the same task as the combiner. In this locally-run example, the combiner and the reducer are redundant. But in a distributed process, there will be many individual combiners that yield the same keys, so the final reducer will aggregate all of these tallies on the same key. Finally, the reduces yields the final tally ``(\"whale\", 14)``.\n",
      "\n",
      "At each step, the control passing between these processes means that each new line of the input file becomes a new task for the ``mapper-combiner-reducer`` sequence.\n",
      "\n",
      "*Though definitely out of the scope of this session on Python, MapReduce is a  powerful framework with which it's helfpul to be familiar if you're in the data space. The [wikipedia article](http://en.wikipedia.org/wiki/MapReduce) is a good place to start.*\n",
      "\n",
      "\n",
      "## The End"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "--------"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Oh, you didn't get enough of iterators and generators? Well, here are some sneak previews of concepts that didn't get included in the main session. Consider these your starting points for googling away at your own pace..."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Passing values *into* the scope of a generator\n",
      "\n",
      "You can actually pass values into the evaluation of generator while it is paused at a ``yield``. Those values can then be used later in the execution!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def knock_knock():\n",
      "    name = yield \"Who's there?\"\n",
      "    print type(name)\n",
      "    yield \"%s who?\" % name\n",
      "    yield \"The end.\"\n",
      "    \n",
      "k = knock_knock()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k.send('Josh')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "k.next()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It turns out that ``yield`` actually returns ``None`` when used in a generator. So, that return value (or any other) can be stored in another variable. Above, ``name`` captures the return value of ``yield``. The ``send()`` generator method resumes execution of the generator and allows the ``yield`` statement to return a different value."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Generator Expressions\n",
      "\n",
      "Just like list comprehensions allow us to execute the equivalent of a ``for`` loop,\n",
      "\n",
      "    for x in i:\n",
      "        foo(x)\n",
      "        \n",
      "    # ^ gives the same results as...\n",
      "    [ foo(x) for x in i ]\n",
      "\n",
      "generator expressions can be used to replace similar loops:\n",
      "\n",
      "    for x in i:\n",
      "        yield x\n",
      "        \n",
      "    # ^ gives the same result as...\n",
      "    ( x for x in i )\n",
      "\n",
      "\n",
      "In a generator expression the ``yield`` is implicit. And just like a generator, the statement is evaluated in a lazy manner: one item at a time, as needed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gen_exp = ( x for x in range(1000))\n",
      "\n",
      "gen_exp"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sum(gen_exp)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## References \n",
      "\n",
      "There are many great references online for learning more about these concepts. Below are a few of the links where I found examples and some of the thoughts shared above:\n",
      "\n",
      "- [Iterator types](https://docs.python.org/2/library/stdtypes.html#iterator-types) (Python docs)\n",
      "- ``mrjob`` [Concepts](https://pythonhosted.org/mrjob/guides/concepts.html#)\n",
      "- David Beazley's [slide deck](http://www.slideshare.net/dabeaz/python-generator-hacking) \"Hacking Generators\"\n",
      "- Jeff Knupp's [post on generators](http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/)\n",
      "- [The Python Practice Book](http://anandology.com/python-practice-book/iterators.html)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}